Today we're gonna apply the newly learned tools for the task of predicting job salary.

Special thanks to Oleg Vasilev for the core assignment idea.
В ходе лабораторной работы были использованы:
Для визуализации результатов в был использован wandb. После тренировки каждой модели прикреплена ссылка на визуализацию результатов. Все результаты доступны по ссылке: https://wandb.ai/lukicheva/my-test-project?workspace=user-lukicheva
Лучший результат по метрикам MSE и MAE показала модель AvgPool без PAD символов.
print_graphics('/content/drive/MyDrive/results/loss.png')
print_graphics('/content/drive/MyDrive/results/mse.png')
print_graphics('/content/drive/MyDrive/results/mae.png')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
For starters, let's download and unpack the data from [here].
You can also get it from yadisk url the competition page (pick Train_rev1.*).
from google.colab import drive
drive.mount('/content/drive')
!unzip drive/MyDrive/Train_rev1.zip
data = pd.read_csv("Train_rev1.csv", index_col=None)
data.shape
data.head()
One problem with salary prediction is that it's oddly distributed: there are many people who are paid standard salaries and a few that get tons o money. The distribution is fat-tailed on the right side, which is inconvenient for MSE minimization.
There are several techniques to combat this: using a different loss function, predicting log-target instead of raw target or even replacing targets with their percentiles among all salaries in the training set. We gonna use logarithm for now.
You can read more in the official description.
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')
plt.figure(figsize=[8, 4])
plt.subplot(1, 2, 1)
plt.hist(data["SalaryNormalized"], bins=20);
plt.subplot(1, 2, 2)
plt.hist(data['Log1pSalary'], bins=20);
Our task is to predict one number, Log1pSalary.
To do so, our model can access a number of features:
Title and FullDescriptionCategory, Company, LocationNormalized, ContractType, and ContractTime.text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"
data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"
data[text_columns] = data[text_columns].fillna('NaN')
data.sample(3)
Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation, etc).
Your task is to lowercase and tokenize all texts under Title and FullDescription columns. Store the tokenized data as a space-separated string of tokens for performance reasons.
It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay.
print("Raw text:")
print(data["FullDescription"][2::100000])
import nltk
#TODO YOUR CODE HERE
tokenizer = nltk.tokenize.WordPunctTokenizer()
# YOUR CODE HERE
data[text_columns] = data[text_columns].applymap(lambda x: " ".join(tokenizer.tokenize(x.lower())))
Now we can assume that our text is a space-separated list of tokens:
print("Tokenized:")
print(data["FullDescription"][2::100000])
assert data["FullDescription"][2][:50] == 'mathematical modeller / simulation analyst / opera'
assert data["Title"][54321] == 'international digital account manager ( german )'
Not all words are equally useful. Some of them are typos or rare words that are only present a few times.
Let's count how many times is each word present in the data so that we can build a "white list" of known words.
from collections import Counter
token_counts = Counter()
# Count how many times does each token occur in both "Title" and "FullDescription" in total
#TODO <YOUR CODE>
for col in text_columns:
for line in data[col].values:
token_counts.update(line.split(" "))
print("Total unique tokens :", len(token_counts))
print('\n'.join(map(str, token_counts.most_common(n=5))))
print('...')
print('\n'.join(map(str, token_counts.most_common()[-3:])))
assert token_counts.most_common(1)[0][1] in range(2600000, 2700000)
assert len(token_counts) in range(200000, 210000)
print('Correct!')
# Let's see how many words are there for each count
plt.hist(list(token_counts.values()), range=[0, 10**4], bins=50, log=True)
plt.xlabel("Word counts");
Task 1.1 Get a list of all tokens that occur at least 10 times.
min_count = 10
# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
#TODO <YOUR CODE>
tokens = sorted(t for t, c in token_counts.items() if c >= min_count)
# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens
print("Vocabulary size:", len(tokens))
assert type(tokens) == list
assert len(tokens) in range(32000, 35000)
assert 'me' in tokens
assert UNK in tokens
print("Correct!")
Task 1.2 Build an inverse token index: a dictionary from token(string) to it's index in tokens (int)
# TODO <YOUR CODE>
token_to_id = {t: i for i, t in enumerate(tokens)}
assert isinstance(token_to_id, dict)
assert len(token_to_id) == len(tokens)
for tok in tokens:
assert tokens[token_to_id[tok]] == tok
print("Correct!")
And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices.
UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])
def as_matrix(sequences, max_len=None):
""" Convert a list of tokens into a matrix with padding """
if isinstance(sequences[0], str):
sequences = list(map(str.split, sequences))
max_len = min(max(map(len, sequences)), max_len or float('inf'))
matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
for i, seq in enumerate(sequences):
row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
matrix[i, :len(row_ix)] = row_ix
return matrix
print("Lines:")
print('\n'.join(data["Title"][::100000].values), end='\n\n')
print("Matrix:")
print(as_matrix(data["Title"][::100000]))
Now let's encode the categirical data we have.
As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc.
from sklearn.feature_extraction import DictVectorizer
# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")
categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))
!pip install wandb
!wandb login
import wandb
Once we've learned to tokenize the data, let's design a machine learning experiment.
As before, we won't focus too much on validation, opting for a simple train-test split.
To be completely rigorous, we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes.
from sklearn.model_selection import train_test_split
data_train, data_val = train_test_split(data, test_size=0.1, random_state=77)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))
print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))
import torch
def to_tensors(batch, device):
batch_tensors = dict()
for key, arr in batch.items():
if key in ["FullDescription", "Title"]:
batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
else:
batch_tensors[key] = torch.tensor(arr, device=device)
return batch_tensors
def make_batch(data, max_len=None, word_dropout=0, device=torch.device('cpu')):
"""
Creates a keras-friendly dict from the batch data.
:param word_dropout: replaces token index with UNK_IX with this probability
:returns: a dict with {'title' : int64[batch, title_max_len]
"""
batch = {}
batch["Title"] = as_matrix(data["Title"].values, max_len)
batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
if word_dropout != 0:
batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
if TARGET_COLUMN in data.columns:
batch[TARGET_COLUMN] = data[TARGET_COLUMN].values
return to_tensors(batch, device)
def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
dropout_mask &= matrix != pad_ix
return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])
make_batch(data_train[:3], max_len=10)
Our basic model consists of three branches:
We will then feed all 3 branches into one common network that predicts salary.

This clearly doesn't fit into keras' Sequential interface. To build such a network, one will have to use PyTorch.
import torch
import torch.nn as nn
import torch.functional as F
class SalaryPredictor(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
# YOUR CODE HERE
def forward(self, batch):
# YOUR CODE HERE
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
As usual, we gonna feed our monster with random minibatches of data.
As we train, we want to monitor not only loss function, which is computed in log-space, but also the actual error measured in dollars.
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):
""" iterates minibatches of data in random order """
while True:
indices = np.arange(len(data))
if shuffle:
indices = np.random.permutation(indices)
for start in range(0, len(indices), batch_size):
batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)
yield batch
if not cycle: break
We can now fit our model the usual minibatch way. The interesting part is that we train on an infinite stream of minibatches, produced by iterate_minibatches function.
import tqdm
BATCH_SIZE = 128
EPOCHS = 3
DEVICE = torch.device('cuda')
LEARNING_RATE = 1e-3
wandb.config = {
"learning_rate": LEARNING_RATE,
"epochs": EPOCHS,
"batch_size": DEVICE
}
def print_metrics(model, data, batch_size=BATCH_SIZE, name="", **kw):
squared_error = abs_error = num_samples = 0.0
model.eval()
with torch.no_grad():
for batch in iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw):
batch_pred = model(batch)
squared_error += torch.mean(torch.square(batch_pred - batch[TARGET_COLUMN]))
abs_error += torch.mean(torch.abs(batch_pred - batch[TARGET_COLUMN]))
num_samples += len(batch)
mse = squared_error.detach().cpu().numpy() / num_samples
mae = abs_error.detach().cpu().numpy() / num_samples
print("%s results:" % (name or ""))
print("Mean square error: %.5f" % mse)
print("Mean absolute error: %.5f" % mae)
return mse, mae
def run_model(model, name='model'):
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
wandb.init(project="my-test-project", entity="lukicheva", name=name)
wandb.watch(model)
for epoch in range(EPOCHS):
print(f"epoch: {epoch}")
for i, batch in tqdm.notebook.tqdm(enumerate(
iterate_minibatches(
data_train,
batch_size=BATCH_SIZE,
device=DEVICE)),
total=len(data_train) // BATCH_SIZE
):
model.train()
pred = model(batch)
optimizer.zero_grad()
loss = criterion(pred, batch[TARGET_COLUMN])
loss.backward()
optimizer.step()
if i % 100 == 99:
print('train_loss', loss.item())
mse, mae = print_metrics(model, data_val, name='val', batch_size=BATCH_SIZE, device=DEVICE)
wandb.log({'train_loss': loss.item(), 'mse': mse, 'mae': mae})
model = SalaryPredictor().to(DEVICE)
run_model(model, name='SP')
В данной части происходила тренировка моделей со сверточными слоями со следующими особенностями:
import cv2 as cv
from google.colab.patches import cv2_imshow
def print_graphics(path):
img = cv.imread(path)
img = np.float32(img)
cv2_imshow(img)
wandb.config = {
"learning_rate": LEARNING_RATE,
"epochs": EPOCHS,
"batch_size": DEVICE
}
class SalaryPredictorBatchNorm(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.BatchNorm1d(hid_size),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.BatchNorm1d(hid_size),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
model_batch_norm = SalaryPredictorBatchNorm().to(DEVICE)
run_model(model_batch_norm, name='SPBatchNorm')
class SalaryPredictorLayerNorm(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder_conv = nn.Conv1d(hid_size, hid_size, kernel_size=2)
self.title_encoder_layer_norm = nn.Sequential(
nn.LayerNorm(hid_size),
nn.Dropout(p=0.25),
nn.ReLU(),
)
self.title_encoder = nn.AdaptiveMaxPool1d(output_size=1)
self.description_encoder_conv = nn.Conv1d(hid_size, hid_size, kernel_size=2)
self.description_encoder_layer_norm = nn.Sequential(
nn.LayerNorm(hid_size),
nn.Dropout(p=0.25),
nn.ReLU(),
)
self.description_encoder = nn.AdaptiveMaxPool1d(output_size=1)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.LayerNorm(hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.LayerNorm(hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.LayerNorm(hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder_conv(title_embeddings).permute(0, 2, 1)
title_features = self.title_encoder_layer_norm(title_features).permute(0, 2, 1)
title_features = self.title_encoder(title_features).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder_conv(description_embeddings).permute(0, 2, 1)
description_features = self.description_encoder_layer_norm(description_features).permute(0, 2, 1)
description_features = self.description_encoder(description_features).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
model_layer_norm = SalaryPredictorLayerNorm().to(DEVICE)
run_model(model_layer_norm, name='SPLayerNorm')
class SalaryPredictorParallelConv(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2, groups=8),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size, hid_size, kernel_size=2, groups=8),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2, groups=8),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size, hid_size, kernel_size=2, groups=8),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
model_parallel_conv = SalaryPredictorParallelConv().to(DEVICE)
run_model(model_parallel_conv, name='SPParallelConv')
class SalaryPredictorMoreLayers(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size * 2, kernel_size=2),
nn.BatchNorm1d(hid_size * 2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size * 2, hid_size, kernel_size=4),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size * 2, kernel_size=2),
nn.BatchNorm1d(hid_size * 2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size * 2, hid_size, kernel_size=4),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 8),
nn.ReLU(),
nn.Linear(hid_size * 8, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
model_more_layers = SalaryPredictorMoreLayers().to(DEVICE)
run_model(model_more_layers, name='SPMoreLayers')
!pip install pytorchtools
def run_model_early_stopping(model, name='model'):
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
wandb.init(project="my-test-project", entity="lukicheva", name=name)
wandb.watch(model)
# early stopping
last_loss = 100
patience = 2
trigger_times = 0
for epoch in range(EPOCHS):
valid_losses = []
print(f"epoch: {epoch}")
for i, batch in tqdm.notebook.tqdm(enumerate(
iterate_minibatches(
data_train,
batch_size=BATCH_SIZE,
device=DEVICE)),
total=len(data_train) // BATCH_SIZE
):
model.train()
pred = model(batch)
optimizer.zero_grad()
loss = criterion(pred, batch[TARGET_COLUMN])
loss.backward()
optimizer.step()
if i % 100 == 99:
model.eval()
with torch.no_grad():
for batch in iterate_minibatches(data_val, batch_size=BATCH_SIZE, shuffle=False, device=DEVICE):
batch_pred = model(batch)
loss = criterion(batch_pred, batch[TARGET_COLUMN])
# record validation loss
valid_losses.append(loss.item())
print('train_loss', loss.item())
mse, mae = print_metrics(model, data_val, name='val', batch_size=BATCH_SIZE, device=DEVICE)
wandb.log({'train_loss': loss.item(), 'mse': mse, 'mae': mae})
current_loss = np.average(valid_losses)
if current_loss > last_loss:
trigger_times += 1
print('Trigger Times:', trigger_times)
if trigger_times >= patience:
print('Early stopping!')
return
else:
print('trigger times: 0')
trigger_times = 0
last_loss = current_loss
early_stop_model = SalaryPredictorMoreLayers().to(DEVICE)
run_model_early_stopping(early_stop_model, name='SPEarlyStopping')
Самыми удачными в этой части лабораторной оказались модели с LayerNorm и с несколькими сверточными слоями. У них были самые низкие показатели MSE и MAE. При этом LayerNorm имеет плавный график, что говорит о хорошем качестве обучения.
print_graphics('/content/drive/MyDrive/results/loss_conv.png')
print_graphics('/content/drive/MyDrive/results/mse_conv.png')
print_graphics('/content/drive/MyDrive/results/mae_conv.png')
В данной части были использованы две модели: с использованием MaxpPool и c использованием AveragePool, который не учитывал PAD-символы.
class SalaryPredictorMaxPool(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.MaxPool1d(kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.MaxPool1d(kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
max_pool_model = SalaryPredictorMaxPool().to(DEVICE)
run_model(max_pool_model, name='SPMaxPool')
a = np.arange(128*8*16).reshape((128,8,16))
b = a*a
b.shape
class SalaryPredictorAvPool(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_conv = nn.Conv1d(hid_size, hid_size, kernel_size=1)
self.title_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU()
)
self.description_conv = nn.Conv1d(hid_size, hid_size, kernel_size=1)
self.description_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU()
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_conv = self.title_conv(title_embeddings).permute(0, 2, 1)
#AveragePool without PAD symbols
mask = batch['Title'] != PAD_IX
denom = torch.sum(mask, -1, keepdim=True)
feat = (torch.sum(title_conv * mask.unsqueeze(-1), dim=1) / denom)
title_features = self.title_encoder(feat)#.squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_conv = self.description_conv(description_embeddings).permute(0, 2, 1)
#AveragePool without PAD symbols
mask = batch['Title'] != PAD_IX
denom = torch.sum(mask, -1, keepdim=True)
feat = torch.sum(title_conv * mask.unsqueeze(-1), dim=1) / denom
description_features = self.description_encoder(feat)#.squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
av_pool_model = SalaryPredictorAvPool().to(DEVICE)
run_model(av_pool_model, name='SPAvPool')
Loss на валидационных данных у обеих моделей во время тренировки скакал примерно в одинаковом диапозоне от 0.08 до 0.18. В то время как MSE и MAE метрики дали различные результаты, судя по которым модель с AveragePool показала себя лучше - она имеет плавный график без скачков и более низкие значения на валидационных данных, в отличае от модели с MaxPool. (MSE: 0.44 для MP и 0.28 для AP, MAE: 0.32 для MP и 0.06 для AP)
print_graphics('/content/drive/MyDrive/results/loss_pool.png')
print_graphics('/content/drive/MyDrive/results/mse_pool.png')
print_graphics('/content/drive/MyDrive/results/mae_pool.png')
В данной части использовались модели с замароженными и размороженными натренированными эмбедингами:
import gensim.downloader as api
# Download the "glove-twitter-25" embeddings
glove_twitter_vectors = torch.FloatTensor(api.load('glove-twitter-25').vectors).to(DEVICE)
word2vec_vectors = torch.FloatTensor(api.load('word2vec-google-news-300').vectors).to(DEVICE)
glove_wiki_vectors = torch.FloatTensor(api.load('glove-wiki-gigaword-50').vectors).to(DEVICE)
class SalaryPredictorPretrainedEmbedings(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8, pretrained_weigths=[], freeze=False):
super().__init__()
self.embedder = nn.Embedding(n_tokens, pretrained_weigths.shape[1])#.from_pretrained(pretrained_weigths, freeze=freeze)
self.title_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder = nn.Sequential(
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * 4, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
# YOUR CODE HERE
def forward(self, batch):
title_embeddings = self.embedder(batch['Title']).permute(0, 2, 1)
title_features = self.title_encoder(title_embeddings).squeeze()
description_embeddings = self.embedder(batch['FullDescription']).permute(0, 2, 1)
description_features = self.description_encoder(description_embeddings).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
hid_size = glove_twitter_vectors.shape[1]
glove_vectors_model_freeze = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=glove_twitter_vectors, freeze=True).to(DEVICE)
run_model(glove_vectors_model_freeze, name='SPGloveVectorsFreeze')
glove_vectors_model = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=glove_twitter_vectors, freeze=False).to(DEVICE)
run_model(glove_vectors_model, name='SPGloveVectors')
Ссылка на график: https://wandb.ai/lukicheva/my-test-project/runs/1p1nd6wv
hid_size = word2vec_vectors.shape[1]
word2vec_vectors_model_freeze = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=word2vec_vectors, freeze=True).to(DEVICE)
run_model(word2vec_vectors_model_freeze, name='SPWord2VecVectorsFreeze')
word2vec_vectors_model = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=word2vec_vectors, freeze=False).to(DEVICE)
run_model(word2vec_vectors_model, name='SPWord2VecVectors')
hid_size = glove_wiki_vectors.shape[1]
glove_wiki_vectors_model_freeze = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=glove_wiki_vectors, freeze=True).to(DEVICE)
run_model(glove_wiki_vectors_model_freeze, name='SPGloveWikiVectorsFreeze')
glove_wiki_vectors_model = SalaryPredictorPretrainedEmbedings(hid_size=hid_size, pretrained_weigths=glove_wiki_vectors, freeze=False).to(DEVICE)
run_model(glove_wiki_vectors_model, name='SPGloveWikiVectors')
Модели с использованием натренированных эмбедингов показали плохие результаты. Loss, MSE и MAE на валидационных данных очень скакали не зависимо от использования freeze. Возможно, данные не очень подходят для работы с этими эмбедингами, но это не точно)))
print_graphics('/content/drive/MyDrive/results/loss_emb.png')
print_graphics('/content/drive/MyDrive/results/mse_emb.png')
print_graphics('/content/drive/MyDrive/results/mae_emb.png')
В этой части лабораторной в моделях использовались рекурентные слои LSTM и GRU, а также их комбинация сл сверточным слоем.
class SalaryPredictorLSTM(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8, num_layers=2, bidirectional=False):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder_LSTM = nn.LSTM(hid_size, hid_size, num_layers, bidirectional=bidirectional)
self.title_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder_LSTM = nn.LSTM(hid_size, hid_size, num_layers, bidirectional=bidirectional)
self.description_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
if bidirectional:
num_ = 6
else:
num_ = 4
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * num_, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title'])
title_features, (hn, cn) = self.title_encoder_LSTM(title_embeddings)
title_features = self.title_encoder(title_features.permute(0, 2, 1)).squeeze()
description_embeddings = self.embedder(batch['FullDescription'])
description_features, (hn, cn) = self.description_encoder_LSTM(description_embeddings)
description_features = self.description_encoder(description_features.permute(0, 2, 1)).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
LSTM_model = SalaryPredictorLSTM().to(DEVICE)
run_model(LSTM_model, name='SPLSTM')
LSTM_model_bidirectional = SalaryPredictorLSTM(bidirectional=True).to(DEVICE)
run_model(LSTM_model_bidirectional, name='SPLSTMBidirectional')
class SalaryPredictorGRU(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8, bidirectional=False):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder_GRU = nn.GRU(hid_size, hid_size, bidirectional=bidirectional)
self.title_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder_GRU = nn.GRU(hid_size, hid_size, bidirectional=bidirectional)
self.description_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
if bidirectional:
num_ = 6
else:
num_ = 4
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * num_, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title'])
title_features, hn = self.title_encoder_GRU(title_embeddings)
title_features = self.title_encoder(title_features.permute(0, 2, 1)).squeeze()
description_embeddings = self.embedder(batch['FullDescription'])
description_features, hn = self.description_encoder_GRU(description_embeddings)
description_features = self.description_encoder(description_features.permute(0, 2, 1)).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
GRU_model = SalaryPredictorGRU().to(DEVICE)
run_model(GRU_model, name='SPGRU')
GRU_model = SalaryPredictorGRU(bidirectional=True).to(DEVICE)
run_model(GRU_model, name='SPGRUBidirectional')
class SalaryPredictorConvRnn(nn.Module):
def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=8, bidirectional=False):
super().__init__()
self.embedder = nn.Embedding(n_tokens, hid_size)
self.title_encoder_GRU = nn.GRU(hid_size, hid_size, bidirectional=bidirectional)
self.title_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.description_encoder_GRU = nn.GRU(hid_size, hid_size, bidirectional=bidirectional)
self.description_encoder = nn.Sequential(
nn.Dropout(p=0.25),
nn.ReLU(),
nn.Conv1d(hid_size, hid_size, kernel_size=2),
nn.Dropout(p=0.25),
nn.ReLU(),
nn.AdaptiveMaxPool1d(output_size=1)
)
self.categorical_encoder = nn.Sequential(
nn.Linear(n_cat_features, hid_size * 2),
nn.ReLU(),
nn.Linear(hid_size * 2, hid_size * 2),
nn.ReLU()
)
if bidirectional:
num_ = 6
else:
num_ = 4
self.final_predictor = nn.Sequential(
nn.Linear(hid_size * num_, hid_size),
nn.ReLU(),
nn.Linear(hid_size, 1)
)
def forward(self, batch):
title_embeddings = self.embedder(batch['Title'])
title_features, hn = self.title_encoder_GRU(title_embeddings)
title_features = self.title_encoder(title_features.permute(0, 2, 1)).squeeze()
description_embeddings = self.embedder(batch['FullDescription'])
description_features, hn = self.description_encoder_GRU(description_embeddings)
description_features = self.description_encoder(description_features.permute(0, 2, 1)).squeeze()
categorical_features = self.categorical_encoder(batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.final_predictor(features).squeeze()
conv_and_rnn_model = SalaryPredictorConvRnn().to(DEVICE)
run_model(conv_and_rnn_model, name='SPConvAndRnn')
Самой хорошей модель оказалась та, в которой использовался однонаправленный LSTM слой. Мне не удалось подобрать удачную комбинацию сверточных и рекурентных слоев. Эта модель показала самй плохой результат среди моделей с рекурентными слоями. Метрики у меделй с GRU очень прыгают на протяжении обучения, независимо от настрйки двунапрвленности.
print_graphics('/content/drive/MyDrive/results/loss_rnn.png')
print_graphics('/content/drive/MyDrive/results/mse_rnn.png')
print_graphics('/content/drive/MyDrive/results/mae_rnn.png')
В этой части я достала признаки с предпоследнего слоя с помощью метода модели children и передала их в RandomForestRegressor.
class FeatureExctracter(nn.Module):
def __init__(self, model):
super().__init__()
self.seq = nn.Sequential(*list(model.children())[:-1])
self.last_layer = model.final_predictor[0]
def forward(self, batch):
title_embeddings = self.seq[0](batch['Title']).permute(0, 2, 1)
title_features = self.seq[1](title_embeddings).squeeze()
description_embeddings = self.seq[0](batch['FullDescription']).permute(0, 2, 1)
description_features = self.seq[2](description_embeddings).squeeze()
categorical_features = self.seq[3](batch['Categorical'])
features = torch.cat(
[title_features, description_features, categorical_features], dim=1)
return self.last_layer(features)
feature_extractor = FeatureExctracter(model)
def get_features(data):
out_features = torch.tensor([]).to(torch.device('cpu'))
for i, batch in tqdm.notebook.tqdm(enumerate(
iterate_minibatches(
data,
batch_size=BATCH_SIZE,
device=DEVICE)),
total=len(data) // BATCH_SIZE
):
batch_pred = feature_extractor(batch)
out_features = torch.cat((out_features, batch_pred.to(torch.device('cpu'))), 0)
return out_features
BATCH_SIZE = 128
out_features_val = get_features(data_val).detach().numpy()
out_features_train = get_features(data_train).detach().numpy()
out_features_train.shape
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100,
max_depth=50,
min_samples_leaf=4,
min_samples_split=10)
rf_model.fit(out_features_train, data_train[TARGET_COLUMN].to_numpy())
y_pred = rf_model.predict(out_features_val)
y_pred
MSE = np.mean(np.square(y_pred - data_val[TARGET_COLUMN].to_numpy())) / len(data_val[TARGET_COLUMN])
MAE = np.mean(np.abs(y_pred - data_val[TARGET_COLUMN].to_numpy())) / len(data_val[TARGET_COLUMN])
MSE
MAE
Этот вариант показал один из лучших вариантов относително осальных значений судя по MSE и MAE.